-
Notifications
You must be signed in to change notification settings - Fork 244
Use GPUArrays accumulation implementation #2813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/perf/array.jl b/perf/array.jl
index 3dbab9816..400de2231 100644
--- a/perf/array.jl
+++ b/perf/array.jl
@@ -54,11 +54,11 @@ let group = addgroup!(group, "reverse")
group["1d"] = @async_benchmarkable reverse($gpu_vec)
group["1dL"] = @async_benchmarkable reverse($gpu_vec_long)
group["2d"] = @async_benchmarkable reverse($gpu_mat; dims=1)
- group["2dL"] = @async_benchmarkable reverse($gpu_mat_long; dims=1)
+ group["2dL"] = @async_benchmarkable reverse($gpu_mat_long; dims = 1)
group["1d_inplace"] = @async_benchmarkable reverse!($gpu_vec)
group["1dL_inplace"] = @async_benchmarkable reverse!($gpu_vec_long)
group["2d_inplace"] = @async_benchmarkable reverse!($gpu_mat; dims=1)
- group["2dL_inplace"] = @async_benchmarkable reverse!($gpu_mat_long; dims=2)
+ group["2dL_inplace"] = @async_benchmarkable reverse!($gpu_mat_long; dims = 2)
end
group["broadcast"] = @async_benchmarkable $gpu_mat .= 0f0
diff --git a/test/runtests.jl b/test/runtests.jl
index b6c479cce..89bf840c9 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -5,7 +5,7 @@ using Printf: @sprintf
using Base.Filesystem: path_separator
using Pkg
-Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="accumulatetests")
+Pkg.add(url = "https://github.com/christiangnrd/GPUArrays.jl", rev = "accumulatetests")
# parse some command-line arguments
function extract_flag!(args, flag, default=nothing; typ=typeof(default)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA.jl Benchmarks
Benchmark suite | Current: 3c02fa9 | Previous: 205c238 | Ratio |
---|---|---|---|
latency/precompile |
43098463154.5 ns |
42934926801 ns |
1.00 |
latency/ttfp |
7012905021 ns |
7008552789 ns |
1.00 |
latency/import |
3574306668 ns |
3569139582 ns |
1.00 |
integration/volumerhs |
9610435 ns |
9606581 ns |
1.00 |
integration/byval/slices=1 |
147160 ns |
147311 ns |
1.00 |
integration/byval/slices=3 |
426070 ns |
426127 ns |
1.00 |
integration/byval/reference |
145095 ns |
145282 ns |
1.00 |
integration/byval/slices=2 |
286522 ns |
286537 ns |
1.00 |
integration/cudadevrt |
103592 ns |
103674 ns |
1.00 |
kernel/indexing |
14293 ns |
14638.5 ns |
0.98 |
kernel/indexing_checked |
14958 ns |
15045 ns |
0.99 |
kernel/occupancy |
720.3851351351351 ns |
669.9465408805031 ns |
1.08 |
kernel/launch |
2162.222222222222 ns |
2202.4444444444443 ns |
0.98 |
kernel/rand |
18437 ns |
17466 ns |
1.06 |
array/reverse/1d |
20190 ns |
20143 ns |
1.00 |
array/reverse/2d |
23777 ns |
24692 ns |
0.96 |
array/reverse/1d_inplace |
10893 ns |
11332 ns |
0.96 |
array/reverse/2d_inplace |
13309 ns |
13662 ns |
0.97 |
array/copy |
21111 ns |
21281 ns |
0.99 |
array/iteration/findall/int |
118061 ns |
159966.5 ns |
0.74 |
array/iteration/findall/bool |
98917 ns |
141602 ns |
0.70 |
array/iteration/findfirst/int |
158577.5 ns |
163419 ns |
0.97 |
array/iteration/findfirst/bool |
159266.5 ns |
165377 ns |
0.96 |
array/iteration/scalar |
73974 ns |
76152 ns |
0.97 |
array/iteration/logical |
175055 ns |
219912.5 ns |
0.80 |
array/iteration/findmin/1d |
47340 ns |
47580 ns |
0.99 |
array/iteration/findmin/2d |
96420 ns |
97060 ns |
0.99 |
array/reductions/reduce/Int64/1d |
46877 ns |
43742.5 ns |
1.07 |
array/reductions/reduce/Int64/dims=1 |
53196 ns |
47519.5 ns |
1.12 |
array/reductions/reduce/Int64/dims=2 |
62497.5 ns |
62503 ns |
1.00 |
array/reductions/reduce/Int64/dims=1L |
89099 ns |
89134 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
90082.5 ns |
87634.5 ns |
1.03 |
array/reductions/reduce/Float32/1d |
34719 ns |
35637 ns |
0.97 |
array/reductions/reduce/Float32/dims=1 |
51741 ns |
51967.5 ns |
1.00 |
array/reductions/reduce/Float32/dims=2 |
59582 ns |
59824 ns |
1.00 |
array/reductions/reduce/Float32/dims=1L |
52550 ns |
52680 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
70184 ns |
70568 ns |
0.99 |
array/reductions/mapreduce/Int64/1d |
46053 ns |
43514 ns |
1.06 |
array/reductions/mapreduce/Int64/dims=1 |
53641.5 ns |
46605.5 ns |
1.15 |
array/reductions/mapreduce/Int64/dims=2 |
63304.5 ns |
62143.5 ns |
1.02 |
array/reductions/mapreduce/Int64/dims=1L |
89158 ns |
89174 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
87207.5 ns |
87305.5 ns |
1.00 |
array/reductions/mapreduce/Float32/1d |
34806 ns |
35464 ns |
0.98 |
array/reductions/mapreduce/Float32/dims=1 |
48546 ns |
42505.5 ns |
1.14 |
array/reductions/mapreduce/Float32/dims=2 |
59774 ns |
60252 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=1L |
52862 ns |
52803 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
70614 ns |
70795 ns |
1.00 |
array/broadcast |
20552 ns |
20737 ns |
0.99 |
array/copyto!/gpu_to_gpu |
11319 ns |
13192 ns |
0.86 |
array/copyto!/cpu_to_gpu |
215254.5 ns |
217123 ns |
0.99 |
array/copyto!/gpu_to_cpu |
283817 ns |
287100 ns |
0.99 |
array/accumulate/Int64/1d |
80265 ns |
126109 ns |
0.64 |
array/accumulate/Int64/dims=1 |
220793 ns |
84201 ns |
2.62 |
array/accumulate/Int64/dims=2 |
112332 ns |
158968 ns |
0.71 |
array/accumulate/Int64/dims=1L |
410035 ns |
1710638 ns |
0.24 |
array/accumulate/Int64/dims=2L |
5155424 ns |
967410.5 ns |
5.33 |
array/accumulate/Float32/1d |
55731 ns |
109994 ns |
0.51 |
array/accumulate/Float32/dims=1 |
201773 ns |
81343 ns |
2.48 |
array/accumulate/Float32/dims=2 |
92523 ns |
148659 ns |
0.62 |
array/accumulate/Float32/dims=1L |
245125 ns |
1619411 ns |
0.15 |
array/accumulate/Float32/dims=2L |
3735231 ns |
699433 ns |
5.34 |
array/construct |
1260.9 ns |
1288.5 ns |
0.98 |
array/random/randn/Float32 |
47976 ns |
45344 ns |
1.06 |
array/random/randn!/Float32 |
24949 ns |
25330 ns |
0.98 |
array/random/rand!/Int64 |
27300 ns |
27554 ns |
0.99 |
array/random/rand!/Float32 |
8829 ns |
8908.333333333334 ns |
0.99 |
array/random/rand/Int64 |
30165 ns |
30218 ns |
1.00 |
array/random/rand/Float32 |
13153 ns |
13361 ns |
0.98 |
array/permutedims/4d |
60598.5 ns |
60397 ns |
1.00 |
array/permutedims/2d |
54811 ns |
54394 ns |
1.01 |
array/permutedims/3d |
55558 ns |
55362 ns |
1.00 |
array/sorting/1d |
2760989 ns |
2758561 ns |
1.00 |
array/sorting/by |
3368803.5 ns |
3368461 ns |
1.00 |
array/sorting/2d |
1088682 ns |
1089562 ns |
1.00 |
cuda/synchronization/stream/auto |
1027.6 ns |
1066.6 ns |
0.96 |
cuda/synchronization/stream/nonblocking |
7564.6 ns |
7691.3 ns |
0.98 |
cuda/synchronization/stream/blocking |
815.8111111111111 ns |
844.0121951219512 ns |
0.97 |
cuda/synchronization/context/auto |
1153.8 ns |
1211.4 ns |
0.95 |
cuda/synchronization/context/nonblocking |
8424.400000000001 ns |
6881.1 ns |
1.22 |
cuda/synchronization/context/blocking |
894.8888888888889 ns |
924.7692307692307 ns |
0.97 |
This comment was automatically generated by workflow using github-action-benchmark.
Well, that's a bit all over the place. |
[only benchmarks]
Indeed.. Hacking in the big mapreduce kernel heuristic for the by-threads or by-block decision into AK we recover most of the performance discrepancy. It's still a regression, but on a 3090 the |
That's too bad, and very much at odds with the results I've seen presented on AK.jl at e.g. JuliaCon. I guess the reduction kernel wasn't really optimized properly yet (the paper seems to focus on sorting operations). |
I suspect the |
@christiangnrd do you think it's worthwhile to open an issue at AK.jl about this (if one isn't open already)? |
Good idea. I opened #60 |
Do I read this correctly? For accumulate:
In the other PR (#2815) for mapreduce:
This is the same trend as the timings I posted when first implementing N-dimensional reductions (JuliaGPU/AcceleratedKernels.jl#6 (comment)); AK-0.1 didn't have @christiangnrd is right, we should definitely improve the heuristic for switching between the by-thread and by-block algorithms. For the innermost reduction kernel though the CUDA.jl algorithm should be superior, and until we have warp sizes and shuffle instructions exposed in KernelAbstractions I don't think we can do much better (implementation and notes here). What is better (and original afaik) in the AK mapreduce is not doing recursive memory allocations when needing multiple kernel launches (switching views into different ends of the same vector here) - memory consumption is bounded and known upfront. I'll use the L cases for Nd mapreduce to investigate bottlenecks... |
Opened to run benchmarks.
Todo: